First steps: set up the script with all the tools you need. This includes loading packages from your library of packages, using the library() command.
install.packages() function, which defaults to looking at the CRAN repository, where curated packages are maintained.update.packages() occasionally.install_github() function within the devtools package (which may be downloaded from CRAN).### Intro to basic R using the GapMinder data set
library(tidyverse)
### if not installed, highlight this and hit cmd-enter: install.packages(tidyverse)
### the tidyverse package is a super-package that loads the following
### packages, among others:
### * tidyr, with functions for data 'tidying'
### * dplyr, with functions for data manipulation and calculation
### * readr, with functions for reading/writing data in various formats
### * ggplot2, with functions for data visualization
library(stringr)
### This is part of the tidyverse package, but doesn't get loaded
### automatically with the 'library()' function. It has functions
### for working with strings.
library(gapminder) # install.packages(gapminder)
### this package contains a stripped-down dataset from GapMinder.
### protip: put these in the 'setup' chunk and they will load automatically
### when you first try to run anything in the script.For this exploration, we’ll use the data from the gapminder package loaded in the previous code chunk. There are other ways to get data into R, the most common being:
read.csv() from the base R package. Old school!read_csv() from the readr package (considerably faster generally than read.csv() for large data sets). Note that when you load the tidyverse package, this gets loaded automatically.readxl) that you may need to install first.### this pulls the data from the gapminder package into an object within the
### local environment to make it easier to access...
gap_df <- gapminderThere are some nice ways to inspect your data once you’ve loaded it into memory. Five that are particularly useful:
View() (or click on the variable name in the ‘Environment’ pane) for certain data types (classes) creates a tab here in the script window with an interactive view of your data.summary() is pretty self-explanatory. The output looks different depending on the type of data you are looking at. It does not show you what the data looks like in its basic form, but gives you an overview of the entire set.summary(gapminder)## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
head() which gives you just the first 6 (default) observations in the set. This shows you what the data actually looks like, in truncated form.head(gapminder)## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
glimpse() for data frames shows you each variable, its class, and the first few instances. This is built into the tidyverse package.glimpse(gapminder)## Observations: 1,704
## Variables: 6
## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
str() shows you the structure of the data. For complex data classes, this lets you see how it’s all put together.str(gapminder)## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
‘Tidy data’ is a philosophy promoted among the data science community, and promoted by Hadley Wickham, a stats professor, R guru, and Chief Scientist at R Studio (he designed and developed many of the ‘new’ packages including the tidyverse family of packages).
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
Looking at the data views above, it looks like this data is already pretty “tidy”, i.e. each row is a single observation and each column is a single variable. But for fun let’s make it “untidy” then tidy it back up again. We’ll create a new object from the gapminder data with an untidy version of the data. Since it is a data frame, so I’ll add a _df tag at the end to help differentiate it.
tidyr packageData tidying uses the tidyr package tools to reshape the data into a more tidy form (and into less tidy forms if you really insist), including:
gather() brings together data from several different variables (columns) into a single variable; a “key” column based on the original column names helps you distinguish them.spread() is the opposite of gather() - take a single column of data, and a “key” column of data information, and spread the data out into columns with names taken from the “key” column.separate() takes a single column with multiple values embedded in each observation (e.g. a name column with “Jane Doe”, “John Doe”, “Ebenezer Smythington”) and slice it into multiple columns (e.g. two columns, first name “Jane” etc. and last name “Doe” etc.)unite() is the opposite of separate() - takes multiple columns and glues them together into one.spread() to make our data wide format.For our example, let’s make columns for each year, containing the population variable, for some reason. This will be “wide” format (more columns) rather than “long” format (more rows). Tidy data is typically “long” format and the tidyverse functions work better in that format.
pop_df_wide <- gap_df %>%
select(country, continent, year, pop) %>%
spread(key = year, value = pop)The %>% is a ‘pipe’ operator which takes the output of one function and passes it into the next function as the first argument The tidyverse functions pretty much all take a dataframe (or tibble class as an updated version of data.frame class) as their first argument, so you can keep passing the dataframe along and modifying it in a more intuitive way, without needing temp variables along the way.
head(pop_df_wide)## # A tibble: 6 × 14
## country continent `1952` `1957` `1962` `1967` `1972`
## <fctr> <fctr> <int> <int> <int> <int> <int>
## 1 Afghanistan Asia 8425333 9240934 10267083 11537966 13079460
## 2 Albania Europe 1282697 1476505 1728137 1984060 2263554
## 3 Algeria Africa 9279525 10270856 11000948 12760499 14760787
## 4 Angola Africa 4232095 4561361 4826015 5247469 5894858
## 5 Argentina Americas 17876956 19610538 21283783 22934225 24779799
## 6 Australia Oceania 8691212 9712569 10794968 11872264 13177000
## # ... with 7 more variables: `1977` <int>, `1982` <int>, `1987` <int>,
## # `1992` <int>, `1997` <int>, `2002` <int>, `2007` <int>
Ugh, that is ugly and hard to work with!
The same functionality could have been accomplished like this:
pop_df <- select(gap_df, country, continent, year, pop)
pop_df_wide <- spread(pop_df, key = year, value = pop)
or nested like this:
pop_df_wide <- spread(select(gap_df, country, continent, year, pop), key = year, value = pop)
But please don’t do that. Harder for other audiences (including your future self, perhaps the most important audience) to read and interpret.
gather() to make data “long” formatpop_df_narrow <- pop_df_wide %>%
gather(key = year, value = pop, `1952`:`2007`)
### it's rare, but if your column headings are numbers (or have spaces) you
### can put the backtick marks around them.Now that we have some idea of the data format, let’s see if there are any questions we might be able to answer with it. Data visualization is hugely helpful for seeing relationships among variables, and the R package ggplot2 can create some nice plots quickly and reasonably intuitively.
Let’s start by looking at how per-capita GDP changes over time. How many countries and how many years are in the data set?
gap_df %>%
select(country) %>%
distinct() %>%
nrow()## [1] 142
gap_df$year %>%
unique() %>%
length()## [1] 12
142 countries plotted across 12 separate points in time - that’s a lot of points. Let’s try it anyway.
The Grammar of Graphics philosophy is a way of thinking about plots as a collection of data variables, mapped to various aesthetics such as color, shape, size, etc. These variables can be represented by different geometric objects - points, lines, bars, etc. The mapping of variables to geometries and aesthetics can be controlled by different scales - e.g. color scales, axis scales, etc. There’s a cool feature called facets that allow you to quickly create small multiples of a plot, each with a variation on the underlying data set. And on top of these, there are themes to control the mechanical aspects of the overall plot - text size/color, grid lines, etc.
Note the ggplot grammar involves ‘+’ instead of ‘%>%’ – think of it as each line adding new data and aesthetic mappings, geometric objects, scales, and facet specification. It’s building the plot one element at a time, rather than piping the plot from one function to the next.
See Hadley Wickham’s take on it here: http://vita.had.co.nz/papers/layered-grammar.pdf
First let’s take the gap_df dataframe and map the year variable to the x axis and the gdpPercap to the y axis. We will assign the data to a point geometry. Lines and bars (and other cool geometries) require a few more aesthetics to properly control them; points are nice and easy.
ggplot(data = gap_df, aes(x = year, y = gdpPercap)) +
geom_point()OK, over time seems like per capita GDP increases; there are a few outliers; but it’s hard to really separate out the components here. What if we aesthetically map the continent variable to a color aesthetic? (We could do countries, but then the legend fills the entire screen to represent colors for 142 countries).
ggplot(data = gap_df, aes(x = year, y = gdpPercap)) +
geom_point(aes(color = continent))We can also create facets to show each continent separately:
ggplot(data = gap_df, aes(x = year, y = gdpPercap)) +
geom_point(aes(color = continent)) +
facet_wrap( ~ continent)What is going on with Asia? we’ll come back to that.
It’s often desirable to modify the data by calculating new variables (e.g. using per capita GDP and population to calculate a country’s total GDP) or by aggregating variables into groups (e.g. finding the mean or max per capita GDP for all countries within a continent).
dplyr packageThe dplyr package provides tools to manipulate your data into more useful forms, including:
mutate() allows you to create new variables as combinations of other informationfilter() allows you to select rows by whether they match certain criteria.select() allows you to select, exclude, or rename columns based on column name (rename() as well)distinct() eliminates duplicate rows; very helpful if you’ve eliminated some columns. Be very cautious with distinct() if you have used group_by()… unique() also works like this, though I think it ignores groups.group_by() allows you to create groups within your dataset that will be considered and calculated separately, though simultaneously, from other groups. ungroup() releases these groups back to the full dataset.summarize() allows you to take grouped data and aggregate variables.gap_df1 <- gap_df %>%
group_by(continent) %>%
mutate(gdp_total = gdpPercap * pop, ### separate multiple variable mutations using commas
pcgdp_max = max(gdpPercap, na.rm = TRUE)) %>%
ungroup()
# glimpse(gap_df1)
### changing the pop from 'integer' class to 'numeric' class avoids an issue
### in the next step; 'integer' can only go up to about 2 billion, but
### other classes ('numeric', 'double') are more flexible.
### Sometimes the 'factor' class can be incredibly useful, but it can
### create weird issues at other times; let's change the continent and
### country variables to 'character' class which are more intuitive.
gap_df1 <- gap_df1 %>%
mutate(pop = as.numeric(pop),
country = as.character(country),
continent = as.character(continent))
gap_continent_summary <- gap_df1 %>%
group_by(continent, year) %>%
summarize(pcgdp_mean = mean(gdpPercap, na.rm = TRUE) %>% round(2),
### pipe the mean into a round(value, number of digits) function to ditch absurd sig figs
pop_mean = round(mean(pop, na.rm = TRUE), 2),
### or just nest the mean into the round() function
pop_total = sum(pop, na.rm = TRUE),
life_mean = mean(lifeExp, na.rm = TRUE) %>% round(2),
n_countries = n()) %>%
ungroup()
### The DT library lets you use html/javascript interactive tables.
### You may need to first install the DT package.
### The format here allows you to access an installed package without
### first loading the package using library(DT).
### This format (package::function()) can also be very useful when,
### for example, two packages both have a same-named function and you want
### to make sure it uses the one from the correct package.
DT::datatable(gap_continent_summary)Note the mean, sum, median, etc functions will return NA if any of the data points in the group are NA. The na.rm = TRUE removes them before calculating. Note that summarize() collapses each grouping into a single row, and drops any columns that are not explicitly grouped or created. Using mutate() keeps all the rows and columns, and just add new variables as requested (and calculates means/maxes according to groups if any).
ggplot outputs can be stored as an object, which can then be further manipulated by adding more geometries/etc. That object can also be passed to other functions like print or ggplotly, or returned from within a function. If you assign the plot to an object, it suppresses the immediate displaying of the plot, but you can view the plot just by calling the plot object’s name.
pcgdp_plot <- ggplot(gap_continent_summary, aes(x = year, y = pcgdp_mean, color = continent, blargle = pop_total)) +
geom_point() +
labs(title = 'Per capita GDP by continent',
x = 'Year',
y = 'Mean per capita GDP (US$)')
# pcgdp_plot
### you can add additional layers to the plot, e.g. lines. The new geometry
### will use the aesthetics as mapped in the original ggplot() call, unless
### you override them. For a line, we need to tell it a 'group' so it knows
### which data points to connect. For aesthetics that don't change based on
### the data, put those outside the aes() call.
pcgdp_plot_lines <- pcgdp_plot +
geom_line(aes(group = continent), alpha = .3, size = 2)
# pcgdp_plot_lines
### or facets (small multiples, divided out by one of your variables)
pcgdp_facet_plot <- pcgdp_plot +
facet_wrap( ~ continent, scales = 'fixed') # (try scales = 'free' too)
# print(pcgdp_facet_plot)
### The plotly package allows you to display interactive versions
### of ggplot objects using the ggplotly() function.
plotly::ggplotly(pcgdp_plot_lines)As you mouse over the points in the ggplotly plot, notice that pop_total shows up… then notice in the original ggplot() call, I added a nonsense aesthetic and mapped pop_total to that. It doesn’t affect the look of the plot, but recognizes that I want that variable to show up in the pop-up window.
Now let’s see about identifying those outliers we saw earlier, with the very high per-capita GDP in Asia in the early part of the dataset. For each year, and each continent, let’s identify the poorest 25% and richest 25% of countries based on the average per-cap GDP across the entire time span. Then we can plot just the richest and inspect those outliers.
asia_df <- gap_df %>%
group_by(country) %>%
mutate(pcgdp_mean = mean(gdpPercap, na.rm = TRUE)) %>%
group_by(continent) %>%
mutate(cutoff_rich = quantile(pcgdp_mean, .75),
cutoff_poor = quantile(pcgdp_mean, .25)) %>%
ungroup() %>%
mutate(wealth = ifelse(pcgdp_mean <= cutoff_poor, 'poor', 'middle'),
wealth = ifelse(pcgdp_mean >= cutoff_rich, 'rich', wealth)) %>%
filter(continent == 'Asia')
asia_gdp_plot <- ggplot(asia_df %>% filter(wealth == 'rich'),
aes(x = year, y = gdpPercap, color = country)) +
theme_bw() + ### built-in theme to change the look of the plot
geom_point(aes(size = pop)) +
geom_line(aes(group = country), alpha = .3, size = 1)
plotly::ggplotly(asia_gdp_plot)Let’s try a different plot, examining whether per capita GDP and life expectancy are related.
What variables are likely to make a difference here? We can use facet_wrap() and aesthetics to compare how variables affect the relationship. We can use ggplotly() to allow for mouse-over to explore outliers.
gap_df_wealth <- gap_df %>%
group_by(country) %>%
mutate(pcgdp_mean = mean(gdpPercap, na.rm = TRUE)) %>%
group_by(continent) %>%
mutate(cutoff_rich = quantile(pcgdp_mean, .75),
cutoff_poor = quantile(pcgdp_mean, .25)) %>%
ungroup() %>%
mutate(wealth = ifelse(pcgdp_mean <= cutoff_poor, 'poor', 'middle'),
wealth = ifelse(pcgdp_mean >= cutoff_rich, 'rich', wealth))
pcgdp_vs_life_plot <- ggplot(gap_df_wealth %>%
filter(continent == 'Asia') %>%
filter(wealth %in% c('poor', 'rich')),
aes(x = gdpPercap, y = lifeExp, asdf = year)) +
geom_point(aes(size = pop, color = country), alpha = .5) + ### use alpha to better see overlapping data pts
scale_color_discrete(guide = 'none') + ### hides color scale from the legend ('guide')
geom_line(aes(color = country, group = country), size = .5, alpha = .5) +
scale_y_continuous(limits = c(0, NA)) + ### force 0 on y axis
facet_grid( ~ wealth, scales = 'free_x') ### does free_x distort perception of data?
pcgdp_vs_life_plot <- pcgdp_vs_life_plot +
theme_bw() +### themes allow for changing the style of a plot.
labs(title = 'Per-cap GDP vs life expectancy in Asia, 1952-2007')
print(pcgdp_vs_life_plot)There are some clear patterns as we follow the countries through time; but let’s eliminate the time variable and just look at the most recent year of data.
pcgdp_life_2007_plot <- ggplot(gap_df_wealth %>%
filter(year == max(year)),
aes(x = gdpPercap,
y = lifeExp,
blargle = country,
shape = continent)) +
theme_bw() +
geom_point(aes(size = pop, color = wealth), alpha = .7) + ### use alpha to better see overlapping data pts
scale_y_continuous(limits = c(0, NA)) ### force 0 on y axis
print(pcgdp_life_2007_plot)Seems like there is a definite pattern there too, but it is clearly not linear. Let’s see if we can linearize the data a bit, by taking a log transformation (base 10) of the gdpPercap variable. From this we can create a linear model of log(gdp) versus life expectancy. Note that this is looking across all years… so let’s map year to an aesthetic, in case a separate pattern emerges.
gap_df_log <- gap_df_wealth %>%
mutate(log_gdp = log10(gdpPercap))
ggplot(gap_df_log, aes(x = log_gdp, y = lifeExp)) +
geom_point(aes(color = year), alpha = .7) +
scale_color_distiller(type = 'seq', palette = 'Spectral')That seems to create a more linear pattern; note that there is a separate trend where it seems like more recent years are pushed upward relative to older years (which makes sense, right?). Let’s create a linear model based on this, and plot a trend line.
gap_log_model <- lm(formula = lifeExp ~ log_gdp, data = gap_df_log)
summary(gap_log_model)##
## Call:
## lm(formula = lifeExp ~ log_gdp, data = gap_df_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.778 -4.204 1.212 4.658 19.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1009 1.2277 -7.413 1.93e-13 ***
## log_gdp 19.3534 0.3425 56.500 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared: 0.6522, Adjusted R-squared: 0.652
## F-statistic: 3192 on 1 and 1702 DF, p-value: < 2.2e-16
Not a terrible model perhaps? would it be better or worse if we just looked at the most current year? Try it and let me know.
In the mean time let’s plot and add a trend line. Note the linear model output is not a data frame; it’s a different class of object: a list. (so is a ggplot object!). And as such we need to understand a little more about complex data structures in R. This chunk explores the structure (str()) and accesses the coefficients within the structure in a couple of different ways.
str(gap_log_model)## List of 12
## $ coefficients : Named num [1:2] -9.1 19.4
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "log_gdp"
## $ residuals : Named num [1:1704] -18.1 -17 -15.6 -13.4 -10.3 ...
## ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
## $ effects : Named num [1:1704] -2455.08 430.51 -14.77 -12.57 -9.44 ...
## ..- attr(*, "names")= chr [1:1704] "(Intercept)" "log_gdp" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:1704] 46.9 47.3 47.6 47.5 46.4 ...
## ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "log_gdp"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.02 1.03
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 1702
## $ xlevels : Named list()
## $ call : language lm(formula = lifeExp ~ log_gdp, data = gap_df_log)
## $ terms :Classes 'terms', 'formula' language lifeExp ~ log_gdp
## .. ..- attr(*, "variables")= language list(lifeExp, log_gdp)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "lifeExp" "log_gdp"
## .. .. .. ..$ : chr "log_gdp"
## .. ..- attr(*, "term.labels")= chr "log_gdp"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(lifeExp, log_gdp)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "log_gdp"
## $ model :'data.frame': 1704 obs. of 2 variables:
## ..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
## ..$ log_gdp: num [1:1704] 2.89 2.91 2.93 2.92 2.87 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language lifeExp ~ log_gdp
## .. .. ..- attr(*, "variables")= language list(lifeExp, log_gdp)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "lifeExp" "log_gdp"
## .. .. .. .. ..$ : chr "log_gdp"
## .. .. ..- attr(*, "term.labels")= chr "log_gdp"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(lifeExp, log_gdp)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "log_gdp"
## - attr(*, "class")= chr "lm"
gap_log_model$coefficients## (Intercept) log_gdp
## -9.100889 19.353423
gap_log_model[['coefficients']]## (Intercept) log_gdp
## -9.100889 19.353423
gap_log_model$coefficients[1] ### the intercept; [2] is the slope based on lifeExp as a function of log_gdp## (Intercept)
## -9.100889
A slope of 19.3534233 means that for every one point increase in log_gdp (which corresponds to a 10x increase in per capita GDP) we get that many additional years of life expectancy!
mdl_int <- gap_log_model$coefficients[1]
mdl_slope <- gap_log_model$coefficients[2]
label_x <- 3
label_y <- mdl_slope * label_x + mdl_int - 5
gap_df_log <- gap_df_log %>%
mutate(wealth = factor(wealth, levels = c('poor', 'middle', 'rich')))
### note the use of factor here, instead of character, to force the order of
### the wealth categories in a categorical variable.
pcgdp_log_plot <- ggplot(gap_df_log, aes(x = log_gdp, y = lifeExp)) +
geom_point(aes(color = wealth, size = pop), alpha = .5) +
scale_color_manual(values = c('poor' = 'coral3', 'middle' = 'cornflowerblue', 'rich' = 'green3')) +
geom_abline(intercept = gap_log_model$coefficients[1],
slope = gap_log_model$coefficients[2],
color = 'red') +
annotate('text', x = label_x, y = label_y,
hjust = 0,
label = sprintf('lifeExp = %.2f * log_gdp + %.2f + ε', mdl_slope, mdl_int))
print(pcgdp_log_plot)I hope this helps with some basics of R, and provides some examples of code to reverse-engineer and use for your own projects.
This little tutorial was created using R Markdown; it is a handy way of communicating analyses in R, especially once you can use GitHub to publish your results directly to the web for others to use and admire.